Loading necessary libraries

library(ISLR2)
library(car)
library(mgcv)
library(rgl)
library(splines)
library(pbapply)

Moving beyond simple (nonparametric) regression

In the past two labs we have seen several methods that allow us to fit a smooth regression function \[y_i=f(x_i)+\varepsilon_i, \quad i=1,\ldots,N\] with \(x_i\in \mathbb{R}\). In today’s lab, we would like to move forward looking at a way to perform multivariate nonparametric regression, that is when \(x_i\in \mathbb{R}^p\), \(p>1\). We will do so by means of an additive approach through the Generalized Additive Models. Let us go back to our well-known Prestige dataset, but let us consider the education variable as well.

with(Prestige, scatterplotMatrix(data.frame(prestige, education, income)))

The relationship between prestige and education seems fairly linear. On the contrary, as we have extensively experienced, prestige and income are quite nonlinearly related.

First off, let’s try to fit a multivariate linear model

model_lm=lm(prestige ~ education + income, data=Prestige)
summary(model_lm)
## 
## Call:
## lm(formula = prestige ~ education + income, data = Prestige)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.4040  -5.3308   0.0154   4.9803  17.6889 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.8477787  3.2189771  -2.127   0.0359 *  
## education    4.1374444  0.3489120  11.858  < 2e-16 ***
## income       0.0013612  0.0002242   6.071 2.36e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.81 on 99 degrees of freedom
## Multiple R-squared:  0.798,  Adjusted R-squared:  0.7939 
## F-statistic: 195.6 on 2 and 99 DF,  p-value: < 2.2e-16

Building prediction surfaces is as simple as it was for the bidimensional case

education.grid=seq(range(Prestige$education)[1],range(Prestige$education)[2],length.out = 100)
income.grid=seq(range(Prestige$income)[1],range(Prestige$income)[2],length.out = 100)
grid=expand.grid(education.grid,income.grid)
names(grid)=c('education','income')
pred=predict(model_lm,newdata=grid) 
persp3d(education.grid,income.grid,pred,col='blue',border="black",lwd=0.3)
with(Prestige, points3d(education,income,prestige,col='black',size=5))

What happens if I add the interaction?

model_lm_interaction <- lm(prestige ~ education + income + education:income, data=Prestige) 
# model_lm_interaction <- lm(prestige ~ education * income, data=Prestige) # alternatively 
summary(model_lm_interaction)
## 
## Call:
## lm(formula = prestige ~ education + income + education:income, 
##     data = Prestige)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.0026  -5.1918   0.3768   4.5915  19.9312 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -2.207e+01  6.581e+00  -3.353 0.001136 ** 
## education         5.373e+00  5.797e-01   9.270 4.65e-15 ***
## income            3.944e-03  1.007e-03   3.918 0.000165 ***
## education:income -1.961e-04  7.460e-05  -2.628 0.009958 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.587 on 98 degrees of freedom
## Multiple R-squared:  0.8113, Adjusted R-squared:  0.8055 
## F-statistic: 140.5 on 3 and 98 DF,  p-value: < 2.2e-16

I manage to capture a bit of the nonlinearity with the interaction term… Can we do better?

pred_interaction=predict(model_lm_interaction,newdata=grid)
persp3d(education.grid,income.grid,pred_interaction,col='grey30')
with(Prestige,points3d(education,income,prestige,col='black',size=5))

We certainly can by employing Generalized Additive Models! The main reference (also, surprisingly easy and straightforward to use) is Generalized Additive Models: An Introduction with R by Simon N. Wood. The package mgcv accompanies the book, it is a very-well structured and actively maintained package that provides computation for smoothness estimation for a variety of models: we will only scratch the surface today! Recall the analytical expression of a GAM:

\[\begin{aligned} y_{i} &=\beta_{0}+\sum_{j=1}^{p} f_{j}\left(x_{i j}\right)+\epsilon_{i} \\ &=\beta_{0}+f_{1}\left(x_{i 1}\right)+f_{2}\left(x_{i 2}\right)+\cdots+f_{p}\left(x_{i p}\right)+\epsilon_{i} \end{aligned}\]

It is called an additive model because we calculate a separate \(f_{j}\) for each \(X_{j}\), and then add together all of their contributions. Operatively, gam() is the function of the mgcvpackage for fitting these types of models. It works very much like the glm function for generalized linear models. Clearly with the former we are allowed to specify diverse smooths options for the model terms.

model_gam=gam(prestige ~ s(education,bs='cr') + s(income,bs='cr'),data = Prestige)

What s() does is basically building a “smooth” term for each of the covariates I am putting in. The behavior is very similar to smooth.spline we have seen last time, no need to set the number of knots, nor (like in the gam package) the equivalent degrees of freedom; mgcv will take care of everything for you. With bs we indicate the (penalized) smoothing basis to use: cr provides a cubic spline basis defined by a modest sized set of knots spread evenly through the covariate values. They are penalized by the conventional intergrated square second derivative cubic spline penalty. The fitting is based on penalized likelihood, where the term to be penalized is the second derivatives of the smooths.

Let us look at the models summary

summary(model_gam)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## prestige ~ s(education, bs = "cr") + s(income, bs = "cr")
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.8333     0.6884   68.03   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                edf Ref.df     F p-value    
## s(education) 3.154  3.913 39.22  <2e-16 ***
## s(income)    2.996  3.604 16.68  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.837   Deviance explained = 84.7%
## GCV = 51.984  Scale est. = 48.34     n = 102

If you are familiar with the output of a glm object, the one above should not surprise you that much. The only slight exotic part may be the “approximate significance of smooth terms”: Effective degrees of freedom (edf) is a summary statistic of GAM and it reflects the degree of non-linearity of a curve. Notice that in this case the adjusted R squared is better, even without the interaction term.

Let us look at the residuals:

hist(model_gam$residuals)

qqnorm(model_gam$residuals)

Normality test:

shapiro.test(model_gam$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_gam$residuals
## W = 0.98887, p-value = 0.5603

Not bad at all! Notice that mgcv works with smoothing bases; nevertheless, the beauty of GAMs is that we can use every type of univariate smoother as building blocks for fitting an additive model. We would for example like to employ natural cubic splines for providing nonparametric components in our GAMs. In this case, we do not actually need anything new then the old but gold lm function and the basis matrix generator contained in the splines package.

model_gam_ns <-
  lm(prestige ~ ns(education, df = 3) + ns(income, df = 3), data = Prestige)

In most situations, the differences in the GAMs obtained using smoothing splines versus natural splines are small, let us look at the residuals scatterplot for the two models:

plot(model_gam_ns$residuals,model_gam$residuals)

cor(model_gam_ns$residuals,model_gam$residuals)
## [1] 0.988401

Pretty much the same fit!

GAMs make the effect interpretation of each regressor to the response variable easy to understand. Again, it works very much like the analysis of a linear model: I look at the contribution of a single predictor holding all the others fixed. This is graphically best accomplished with the plot method conveniently provided by the mgcv package.

plot(model_gam)

The same can be achieved for the natural splines model, with a tiny workaround (programming tip!)

gam::plot.Gam(model_gam_ns, se=TRUE)

Again, we see that there are basically no difference between model_gam and model_gam_ns.

We have noticed already at the very beginning that the relationship between prestige and education seems fairly linear. We now test whether a smooth function is needed for education or if we can be satisfied with a linear contribution. The reduced model reads:

model_gam_reduced=gam(prestige ~ education + s(income,bs='cr'),data = Prestige)
summary(model_gam_reduced)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## prestige ~ education + s(income, bs = "cr")
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.1342     3.7355   1.107    0.271    
## education     3.9764     0.3415  11.644   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##             edf Ref.df     F p-value    
## s(income) 3.353  4.012 15.35  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.825   Deviance explained = 83.3%
## GCV = 54.563  Scale est. = 51.7      n = 102

The linear contribution of education is highly significant, but which model is better? Let us perform a formal test as we would do if we needed to perform nested model selection with lm: the anova function serves the purpose.

anova(model_gam_reduced,model_gam, test = "F")

Of course we could have had it hard-coded:

N <- nrow(Prestige)
RSS_reduced <- deviance(model_gam_reduced)
RSS_full <- deviance(model_gam)
df_full <- N-sum(summary(model_gam)$s.table[,2])-1 # -1 for the intercept
df_reduced <- N-sum(summary(model_gam_reduced)$s.table[,2])-2 # -2 for the intercept and the linear contribution of education
df_difference <- df_reduced-df_full
F_value_manual <- ((RSS_reduced-RSS_full)/df_difference)/(RSS_full/summary(model_gam)$residual.df)
pf(F_value_manual, df1 = df_difference,df2 = df_full,lower.tail = FALSE)
## [1] 0.02791939

It seems that a nonlinear component for education is needed after all. Important: if my residuals were not normal I would have needed to perform a non-parametric test, but you now know how to do it right? More on this later.

Performing prediction for GAMs is as simple as it always has been:

pred_gam=predict(model_gam,newdata=grid)

persp3d(education.grid,income.grid,pred_gam,col='grey30')
with(Prestige,points3d(education,income,prestige,col='black',size=5))

We can include interactions in a GAM as well: we could either employ a simpler approach (the one we will subsequently use) by defining an interaction term and smooth it, or we could employ more sophisticated bidimensional splines, like thin plate spline (Wood, 2003). In general, it is better to consider bivariate splines when the two dimensions are tightly related (e.g., different coordinates in the space).

model_gam_inter = gam(prestige ~ s(education, bs = 'cr') + 
                        s(income, bs ='cr') + 
                        s(I(income * education), bs = 'cr'),
                      data = Prestige)
# model_gam_inter=gam(prestige ~ s(education,bs='cr') + s(income,bs='cr')+ s(education,income,bs='tp'),data = Prestige) # thin plate spline
pred_inter = predict(model_gam_inter,
                     newdata = data.frame(grid, inter = grid$education * grid$income))

persp3d(education.grid, income.grid, pred_inter, col = 'grey30')
with(Prestige,
     points3d(education, income, prestige, col = 'black', size = 5))

Exam like exercise 1 (2021/01/22)

Dr. Simoni, a data scientist, just had his second kid (a boy, 58 cm tall) and he is interested to know how tall his child will be at 25 years of age, given his height at birth. To do so, he has collected the height at birth of 100 males, alongside the height they reached when they were 25. He has collected these data in the file boyheight.rda, height.25 is the height [cm] at 25 years of age, while height.b is the length of the newborn [cm]. Assume that \[height25 = f (height.b) + \varepsilon.\]

  1. Build a degree 1 regression spline model, with breaks at the 25th and 75th percentile of height.b, to predict the height at 25 from the height at birth. Provide a plot of the regression line, compute the pointwise prediction (round it at the second decimal digit) for the height at 25 of Dr. Simoni newborn child, and calculate, using a bootstrap approach on the residuals, the bias, variance and Mean Squared Error (MSE) of such prediction

  2. Dr. Simoni is not particularly satisfied with the predictions of such a simple model, and would like you to do more. So build a prediction model for the height at 25 based on a smoothing spline of order 4 (select the lambda parameter via Leave-One-Out CV). Report the optimal lambda value (2 decimal digits), provide a plot of the regression line, alongside the point-wise prediction of the height at 25 of Dr. Simoni’s kid, and calculate using a bootstrap approach on the residuals the bias, variance and MSE of such prediction (fix the lambda value to the one obtained via Leave-One-Out CV).

Solution

load(here::here("Block III - Nonparametric regression/data/boyheight.rda"))
plot(x=height.b, y=height.25)

# Breaks at the 25th and 75th percentile of *height.b*
  
knots_heigth <- quantile(height.b,probs = c(.25,.75))

# Build a degree 1 regression spline model

fit_spline <- lm(height.25~bs(height.b,knots = knots_heigth,degree = 1))
new_data_seq <- seq(min(height.b),max(height.b),length.out=100)

# Predict the height at 25 from the height at birth
preds=predict(fit_spline, newdata = list(height.b=new_data_seq),se=T)
se.bands=cbind(preds$fit +2* preds$se.fit ,preds$fit -2* preds$se.fit)
# Plot of the regression line
plot(y=height.25,x=height.b  ,cex =.5, col =" darkgrey " )
lines(new_data_seq,preds$fit ,lwd =2, col =" blue")
matlines(new_data_seq,se.bands ,lwd =1, col =" blue",lty =3)

# Pointwise prediction
simoni_weigth <- 58
(pointwise_pred <- predict(fit_spline, newdata = list(height.b=simoni_weigth)))
##        1 
## 191.4279
# Bias, variance and Mean Squared Error (MSE) of pointwise_pred
wrapper_f <- function(){
  height.25.boot=fitted+sample(residuals,n,replace=T)
  new_model=lm(height.25.boot ~ bs(height.b, knots=knots_heigth,degree=1))
  predict(new_model, newdata = list(height.b=simoni_weigth))
}

B <- 200
fitted=fit_spline$fitted.values
residuals=fit_spline$residuals
n <- length(height.25)
set.seed(100)

boot_d <- pbreplicate(n = B,expr = wrapper_f(),simplify = "vector")
(variance_pred <- var(boot_d))
## [1] 1.158379
(bias_pred <- pointwise_pred-mean(boot_d))
##          1 
## 0.04960756
(MSE_pred <- variance_pred +bias_pred^2)
##       1 
## 1.16084
# smoothing spline of order 4 (cv=TRUE for LOOCV)
fit_smooth <- smooth.spline(x = height.b, y=height.25,cv = TRUE)

# optimal lambda value
round(fit_smooth$lambda,2)
## [1] 0.01
# plot of the regression line
plot(y=height.25,x=height.b  ,cex =.5, col =" darkgrey " )
lines(fit_smooth$x, fit_smooth$y, col="blue", lwd=2)

fitted_smooth <- predict(fit_smooth, height.b)$y
fitted_res <- height.25-fitted_smooth

pred_smooth_simoni <- predict(fit_smooth, simoni_weigth)
# pointwise prediction of the height at 25 of Dr. Simoni’s kid
pred_smooth_simoni$y
## [1] 191.9669
boot_smooth=numeric(B)
set.seed(100)
for(sample in 1:B){
  
  height.25.boot=fitted_smooth+sample(fitted_res,n,replace=T)
  new_model=smooth.spline(y = height.25.boot,x =  height.b, lambda = fit_smooth$lambda)
  boot_smooth[sample]=predict(new_model, simoni_weigth)$y
}


(variance_pred_smooth <- var(boot_smooth))
## [1] 0.9159435
(bias_pred_smooth <- pred_smooth_simoni$y-mean(boot_smooth))
## [1] -0.3520232
(MSE_pred_smooth <- variance_pred_smooth +bias_pred_smooth^2)
## [1] 1.039864

Exam like exercise 2 (2021/06/14)

Dr. Bisacciny, Ph.D is becoming increasingly worried about the fact that the bees that he decided to place near a corn field close to the Italian city of Milan tend to die way earlier than the ones that are placed close to the small village of Jovençan, in Aosta Valley. He suspects that additional factors that influence the survival time of a the bee are its productivity, and its weight. For this reason, he decides to run an experiment, in which he selects 10,000 bees, placing 5,000 of them in Aosta Valley and 5,000 in Milan. Dr. Bisacciny runs the experiment for 50 days, during which he annotates when a bee passes away. After the end of the experiment, he starts to analyse the data. In the file ex02.rda you can find a dataframe with information about the status of the bee (1 if alive, 2 if dead), the survival time of the bee (surv.time), its weight (weight) and productivity (prod) as well as its being in Jovençan or in Milan. Modeling this data as i.i.d. realizations of a four dimensional random variable:

  1. Build an additive model for log(surv.time), where log() is the natural logarithm, using cubic b-spline terms for the main effects, a dummy term for location and no interactions. After having written in proper mathematical terms the additive model you have estimated, report the adjusted \(R^2\) and the p-values of the tests. Comment the result (assume the residuals to be normal).
  2. After having reduced the model to its significant terms and use this model to provide a pointwise prediction for the average log-survival time for a bee that lives in Milan or in Jovençan. Is it necessary to use the gam machinery to estimate the reduced model?

Solution

load(here::here("Block III - Nonparametric regression/data/ex02.rda"))
head(bee)
fit_gam <- gam(log(surv.time)~s(weight,bs = "cr")+s(prod,bs = "cr")+location,data = bee)

Estimated model

\[\begin{equation} log(surv.time) =\beta_{0}+f\left(weight\right)+f\left(prod\right)+\beta_1 location_{Milan}+\epsilon \end{equation}\]
table_fit_gam <- summary(fit_gam)
# adjusted R2
(r_2_squared <- table_fit_gam$r.sq)
## [1] 0.4569872
# p-values of the tests (parametric coefficients)
table_fit_gam$p.pv
##   (Intercept) locationMilan 
##             0             0
# p-values of the tests (smooth terms)
table_fit_gam$s.pv
## [1] 0.1002523 0.4254312
plot(fit_gam)

hist(fit_gam$residuals)

# Remove weight
fit_gam_reduced_1 <- gam(log(surv.time)~s(prod,bs = "cr")+location,data = bee)
anova(fit_gam_reduced_1,fit_gam, test = "F")
# Remove prod
fit_gam_reduced_2 <- gam(log(surv.time)~location,data = bee)
anova(fit_gam_reduced_2, fit_gam_reduced_1, test = "F")

I can use a simple linear model with just a term: location. No need to use the gam machinery anymore.

fit_lm <- lm(log(surv.time)~location,data = bee)

# Avg log surv time in Milan
sum(fit_lm$coefficients)
## [1] 1.168164
# mean(log(bee$surv.time[bee$location=="Milan"])) # alternatively

# Avg log surv time in Jovencan
fit_lm$coefficients[1]
## (Intercept) 
##    3.010891
# mean(log(bee$surv.time[bee$location=="Jovencan"])) # alternatively

More concise way using tapply and functional programming

tapply(log(bee$surv.time), bee$location, mean)
## Jovencan    Milan 
## 3.010891 1.168164